In [2]:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd
Then, read the (sample) input tables for matching purposes.
In [3]:
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'
path_A = datasets_dir + os.sep + 'dblp_demo.csv'
path_B = datasets_dir + os.sep + 'acm_demo.csv'
path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'
In [5]:
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
# Load the pre-labeled data
S = em.read_csv_metadata(path_labeled_data,
key='_id',
ltable=A, rtable=B,
fk_ltable='ltable_id', fk_rtable='rtable_id')
S.head()
Out[5]:
_id
ltable_id
rtable_id
ltable_title
ltable_authors
ltable_year
rtable_title
rtable_authors
rtable_year
label
0
0
l1223
r498
Dynamic Information Visualization
Yannis E. Ioannidis
1996
Dynamic information visualization
Yannis E. Ioannidis
1996
1
1
1
l1563
r1285
Dynamic Load Balancing in Hierarchical Parallel Database Systems
Luc Bouganim, Daniela Florescu, Patrick Valduriez
1996
Dynamic Load Balancing in Hierarchical Parallel Database Systems
Luc Bouganim, Daniela Florescu, Patrick Valduriez
1996
1
2
2
l1514
r1348
Query Processing and Optimization in Oracle Rdb
Gennady Antoshenkov, Mohamed Ziauddin
1996
prospector: a content-based multimedia server for massively parallel architectures
S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader
1996
0
3
3
l206
r1641
An Asymptotically Optimal Multiversion B-Tree
Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger
1996
A complete temporal relational algebra
Debabrata Dey, Terence M. Barron, Veda C. Storey
1996
0
4
4
l1589
r495
Evaluating Probabilistic Queries over Imprecise Data
Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar
2003
Evaluating probabilistic queries over imprecise data
Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar
2003
1
Then, split the labeled data into development set and evaluation set. Use the development set to select the best learning-based matcher
In [6]:
# Split S into I an J
IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)
I = IJ['train']
J = IJ['test']
In [7]:
brm = em.BooleanRuleMatcher()
In [8]:
# Generate a set of features
F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)
We observe that there were 20 features generated. As a first step, lets say that we decide to use only 'year' related features.
In [9]:
F.feature_name
Out[9]:
0 id_id_lev_dist
1 id_id_lev_sim
2 id_id_jar
3 id_id_jwn
4 id_id_exm
5 id_id_jac_qgm_3_qgm_3
6 title_title_jac_qgm_3_qgm_3
7 title_title_cos_dlm_dc0_dlm_dc0
8 title_title_mel
9 title_title_lev_dist
10 title_title_lev_sim
11 authors_authors_jac_qgm_3_qgm_3
12 authors_authors_cos_dlm_dc0_dlm_dc0
13 authors_authors_mel
14 authors_authors_lev_dist
15 authors_authors_lev_sim
16 year_year_exm
17 year_year_anm
18 year_year_lev_dist
19 year_year_lev_sim
Name: feature_name, dtype: object
Before we can use the rule-based matcher, we need to create rules to evaluate tuple pairs. Each rule is a list of strings. Each string specifies a conjunction of predicates. Each predicate has three parts: (1) an expression, (2) a comparison operator, and (3) a value. The expression is evaluated over a tuple pair, producing a numeric value.
In [10]:
# Add two rules to the rule-based matcher
# The first rule has two predicates, one comparing the titles and the other looking for an exact match of the years
brm.add_rule(['title_title_lev_sim(ltuple, rtuple) > 0.4', 'year_year_exm(ltuple, rtuple) == 1'], F)
# This second rule compares the authors
brm.add_rule(['authors_authors_lev_sim(ltuple, rtuple) > 0.4'], F)
brm.get_rule_names()
Out[10]:
['_rule_0', '_rule_1']
In [11]:
# Rules can also be deleted from the rule-based matcher
brm.delete_rule('_rule_1')
Out[11]:
True
Now that our rule-based matcher has some rules, we can use it to predict whether a tuple pair is actually a match. Each rule is is a conjunction of predicates and will return True only if all the predicates return True. The matcher is then a disjunction of rules and if any one of the rules return True, then the tuple pair will be a match.
In [12]:
brm.predict(S, target_attr='pred_label', append=True)
S
Out[12]:
_id
ltable_id
rtable_id
ltable_title
ltable_authors
ltable_year
rtable_title
rtable_authors
rtable_year
label
pred_label
0
0
l1223
r498
Dynamic Information Visualization
Yannis E. Ioannidis
1996
Dynamic information visualization
Yannis E. Ioannidis
1996
1
1
1
1
l1563
r1285
Dynamic Load Balancing in Hierarchical Parallel Database Systems
Luc Bouganim, Daniela Florescu, Patrick Valduriez
1996
Dynamic Load Balancing in Hierarchical Parallel Database Systems
Luc Bouganim, Daniela Florescu, Patrick Valduriez
1996
1
1
2
2
l1514
r1348
Query Processing and Optimization in Oracle Rdb
Gennady Antoshenkov, Mohamed Ziauddin
1996
prospector: a content-based multimedia server for massively parallel architectures
S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader
1996
0
0
3
3
l206
r1641
An Asymptotically Optimal Multiversion B-Tree
Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger
1996
A complete temporal relational algebra
Debabrata Dey, Terence M. Barron, Veda C. Storey
1996
0
0
4
4
l1589
r495
Evaluating Probabilistic Queries over Imprecise Data
Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar
2003
Evaluating probabilistic queries over imprecise data
Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar
2003
1
1
5
5
l43
r1415
Optimization of Run-time Management of Data Intensive Web-sites
Khaled Yagoub, Dan Suciu, Alon Y. Levy, Daniela Florescu
1999
On random sampling over joins
Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya
1999
0
0
6
6
l1466
r1348
Access Path Support for Referential Integrity in SQL2
Joachim Reinert, Theo Hrder
1996
prospector: a content-based multimedia server for massively parallel architectures
S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader
1996
0
0
7
7
l1535
r1800
Mariposa: A Wide-Area Distributed Database System
Carl Staelin, Paul M. Aoki, Witold Litwin, Michael Stonebraker, Adam Sah, Jeff Sidell, Andrew Yu...
1996
Further Improvements on Integrity Constraint Checking for Stratifiable Deductive Databases
Sin Yeung Lee, Tok Wang Ling
1996
0
0
8
8
l1317
r1676
QuickStore: A High Performance Mapped Object Store
David J. DeWitt, Seth J. White
1994
An Overview of Repository Technology
Philip A. Bernstein, Umeshwar Dayal
1994
0
0
9
9
l621
r175
Communication Efficient Distributed Mining of Association Rules
Ran Wolff, Assaf Schuster
2001
Editorial
Richard Snodgrass
2001
0
0
10
10
l668
r1694
Indexing Multimedia Databases (Tutorial)
Christos Faloutsos
1995
Information finding in a digital library: the Stanford perspective
Tak W. Yan, Héctor García-Molina
1995
0
0
11
11
l1189
r1674
Weimin Du, Xiangning Liu, Abdelsalam Helal
Multiview Access Protocols for Large-Scale Replication
1998
Multiview access protocols for large-scale replication
Xiangning Liu, Abdelsalam Helal, Weimin Du
1998
1
0
12
12
l1657
r110
Semantic B2B Integration
Christoph Bussler
2001
Monitoring business processes through event correlation based on dependency model
Asaf Adii, David Botzer, Opher Etzion, Tali Yatzkar-Haham
2001
0
0
13
13
l1490
r599
Extracting Large Data Sets using DB2 Parallel Edition
Sriram Padmanabhan
1996
Extracting Large Data Sets using DB2 Parallel Edition
Sriram Padmanabhan
1996
1
1
14
14
l595
r87
Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? (Panel)
Kyuseok Shim, Rajeev Rastogi, Minos N. Garofalakis, Sridhar Ramaswamy
1999
Of crawlers, portals, mice, and men: is there more to mining the Web?
Minos N. Garofalakis, Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim
1999
1
1
15
15
l380
r1337
Outerjoin Simplification and Reordering for Query Optimization
Csar A. Galindo-Legaria, Arnon Rosenthal
1997
Outerjoin simplification and reordering for query optimization
César Galindo-Legaria, Arnon Rosenthal
1997
1
1
16
16
l165
r1118
Cache-and-Query for Wide Area Sensor Databases
Phillip B. Gibbons, Srinivasan Seshan, Suman Kumar Nath, Amol Deshpande
2003
Cache-and-query for wide area sensor databases
Amol Deshpande, Suman Nath, Phillip B. Gibbons, Srinivasan Seshan
2003
1
1
17
17
l796
r588
Generating Dynamic Content at Database-Backed Web Servers: cgi-bin vs. mod_perl
Alexandros Labrinidis, Nick Roussopoulos
2000
Novel Approaches in Query Processing for Moving Object Trajectories
Dieter Pfoser, Christian S. Jensen, Yannis Theodoridis
2000
0
0
18
18
l1160
r1733
Khaled Alsabti, Vineet Singh, Sanjay Ranka
A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data
1997
A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data
Khaled Alsabti, Sanjay Ranka, Vineet Singh
1997
1
0
19
19
l1752
r3
SHORE: Combining the Best Features of OODBMS and File Systems
Shore Team
1995
The LyriC language: querying constraint objects
Alexander Brodsky, Yoram Kornatzky
1995
0
0
20
20
l1647
r945
Cost Based Query Scrambling for Initial Delays
Tolga Urhan, Michael J. Franklin, Laurent Amsaleg
1998
The Cubetree Storage Organization
Nick Roussopoulos, Yannis Kotidis
1998
0
0
21
21
l1135
r1127
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
Peter J. Haas, Lynne Stokes, S. Seshadri, Jeffrey F. Naughton
1995
View maintenance in a warehousing environment
Yue Zhuge, Héctor García-Molina, Joachim Hammer, Jennifer Widom
1995
0
0
22
22
l1776
r987
Walking Through a Very Large Virtual Environment in Real-time
Yixin Ruan, Kian-Lee Tan, Jason Chionh, Lidan Shou, Zhiyong Huang
2001
Walking Through a Very Large Virtual Environment in Real-time
Lidan Shou, Jason Chionh, Zhiyong Huang, Yixin Ruan, Kian-Lee Tan
2001
1
1
23
23
l676
r1395
Datawarehousing Has More Colours Than Just Black & White
Thomas Zurek, Markus Sinnwell
1999
Datawarehousing Has More Colours Than Just Black &; White
Thomas Zurek, Markus Sinnwell
1999
1
1
24
24
l1087
r648
The Grid: An Application of the Semantic Web
Carole A. Goble, David De Roure
2002
An XML query engine for network-bound data
Zachary G. Ives, A. Y. Halevy, D. S. Weld
2002
0
0
25
25
l629
r1478
Engineering Federated Information Systems: Report of EFIS '99 Workshop
Flix Saltor, Uwe Hohenstein, Ralf-Detlef Kutsche, Wilhelm Hasselbring, Gunter Saake, Stefan Conr...
1999
Engineering federated information systems: report of EEFIS '99 workshop
S. Conrad, W. Hasselbring, U. Hohenstein, R.-D. Kutsche, M. Roantree, G. Saake, F. Saltor
1999
1
1
26
26
l649
r1366
Random Sampling for Histogram Construction: How much is enough?
Vivek R. Narasayya, Rajeev Motwani, Surajit Chaudhuri
1998
Random sampling for histogram construction: how much is enough?
Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya
1998
1
1
27
27
l211
r1490
BeSS: Storage Support for Interactive Visualization Systems
William O'Connell, Thomas A. Funkhouser, Alexandros Biliris, Euthimios Panagos
1996
BeSS: storage support for interactive visualization systems
A. Biliris, T. A. Funkhouser, W. O'Connell, E. Panagos
1996
1
1
28
28
l734
r384
Min-Max Compression Methods for Medical Image Databases
John M. Tyler, Kosmas Karadimitriou
1997
Min-max compression methods for medical image databases
Kosmas Karadimitriou, John M. Tyler
1997
1
1
29
29
l611
r141
Mining Generalized Association Rules
Ramakrishnan Srikant, Rakesh Agrawal
1995
Multi-table joins through bitmapped join indices
Patrick O'Neil, Goetz Graefe
1995
0
0
...
...
...
...
...
...
...
...
...
...
...
...
420
420
l834
r883
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications
Ashraf Aboulnaga, Jeffrey F. Naughton, Alaa R. Alameldeen
2001
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications
Ashraf Aboulnaga, Alaa R. Alameldeen, Jeffrey F. Naughton
2001
1
1
421
421
l746
r301
Providing Database Migration Tools - A Practicioner's Approach
Andreas Meier
1995
Providing Database Migration Tools - A Practicioner's Approach
Andreas Meier
1995
1
1
422
422
l1332
r619
Workshop on Workflow Management in Scientific and Engineering Applications - Report
Gottfried Vossen, Richard McClatchey
1997
Workshop on workflow management in scientific and engineering applications-report
R. McClatchey, G. Vossen
1997
1
1
423
423
l942
r1473
Research in Databases and Data-Intensive Applications - Computer Science Department and FZI, Uni...
Birgitta Knig-Ries, Peter C. Lockemann
1997
Research in databases and data-intensive applications: Computer Science Dept. and FIZ, Universit...
Brigitta König-Ries, Peter C. Lockermann
1997
1
1
424
424
l806
r356
Tribeca: A Stream Database Manager for Network Traffic Analysis
Mark Sullivan
1996
Type-safe relaxing of schema consistency rules for flexible modelling in OODBMS
Eric Amiel, Marie-Jo Bellosta, Eric Dujardin, Eric Simon
1996
0
0
425
425
l794
r784
Spatial Data Management for Computer Aided Design
Andreas Mller, Marco Ptke, Thomas Seidl, Hans-Peter Kriegel
2001
Dynamic content acceleration: a caching solution to enable scalable dynamic Web page generation
Anindya Datta, Kaushik Dutta, Krithi Ramamritham, Helen Thomas, Debra VanderMeer
2001
0
0
426
426
l28
r1618
Storage Technology: RAID and Beyond
Garth A. Gibson
1995
Tutorial on storage technology: RAID and beyond
Garth A. Gibson
1995
1
1
427
427
l1183
r1409
Stephen Blott, Roger Weber, Hans-Jrg Schek
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional ...
1998
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional ...
Roger Weber, Hans-Jörg Schek, Stephen Blott
1998
1
0
428
428
l1122
r232
Interview with Jim Gray
Marianne Winslett
2003
In-context peer-to-peer information filtering on the Web
Aris M. Ouksel
2003
0
0
429
429
l1430
r1444
Condition Handling in SQL Persistent Stored Modules
Jeff Richey
1995
Condition handling in SQL persistent stored modules
Jeff Richey
1995
1
1
430
430
l1494
r1257
The Mariposa Distributed Database Management System
Jeff Sidell
1996
Open issues in parallel query optimization
Waqar Hasan, Daniela Florescu, Patrick Valduriez
1996
0
0
431
431
l1592
r439
Report on the 18th British National Conference on Databases (BNCOD)
Carole A. Goble, Brian J. Read
2002
Contracting in the days of eBusiness
W. Hümmer, W. Lehner, H. Wedekind
2002
0
0
432
432
l1015
r45
Database Systems - Breaking Out of the Box
Abraham Silberschatz, Stanley B. Zdonik
1997
Dynamic Memory Adjustment for External Mergesort
Weiye Zhang, Per-Åke Larson
1997
0
0
433
433
l1147
r1016
Xiaolei Qian
Scientist's Called Upon to Take Actions
1996
Scientists called upon to take actions
Xiaolei Qian
1996
1
0
434
434
l1756
r310
ARIES/CSA: A Method for Database Recovery in Client-Server Architectures
C. Mohan, Inderpal Narang
1994
Enterprise information architectures-they're finally changing
Wesley P. Melling
1994
0
0
435
435
l1044
r67
Digital Library Services in Mobile Computing
Evaggelia Pitoura, Melliyal Annamalai, Bharat K. Bhargava
1995
Ordered shared locks for real-time databases
Divyakant Agrawal, Amr El Abbadi, Richard Jeffers, Lijing Lin
1995
0
0
436
436
l412
r651
Phoenix: Making Applications Robust
David B. Lomet, Roger S. Barga
1999
DataBlitz storage manager: main-memory database performance for critical applications
J. Baulier, P. Bohannon, S. Gogate, C. Gupta, S. Haldar
1999
0
0
437
437
l796
r1808
Generating Dynamic Content at Database-Backed Web Servers: cgi-bin vs. mod_perl
Alexandros Labrinidis, Nick Roussopoulos
2000
On wrapping query languages and efficient XML integration
Vassilis Christophides, Sophie Cluet, Jérǒme Simèon
2000
0
0
438
438
l1570
r1468
Instance-based attribute identification in database integration
Roger H. L. Chiang, Ee-Peng Lim, Chua Eng Huang Cecil
2003
Index-driven similarity search in metric spaces
Gisli R. Hjaltason, Hanan Samet
2003
0
0
439
439
l1577
r688
Data Mining Using Two-Dimensional Optimized Accociation Rules: Scheme, Algorithms, and Visualiza...
Shinichi Morishita, Yasuhiko Morimoto, Takeshi Tokuyama, Takeshi Fukuda
1996
Static detection of security flaws in object-oriented databases
Keishi Tajima
1996
0
0
440
440
l617
r310
Fine-Grained Sharing in a Page Server OODBMS
Michael J. Carey, Markos Zaharioudakis, Michael J. Franklin
1994
Enterprise information architectures-they're finally changing
Wesley P. Melling
1994
0
0
441
441
l1304
r1178
Query Rewriting for Semistructured Data
Vasilis Vassalos, Yannis Papakonstantinou
1999
The Aqua approximate query answering system
Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy
1999
0
0
442
442
l727
r597
Design and Analysis of Parametric Query Optimization Algorithms
Sumit Ganguly
1998
Incremental distance join algorithms for spatial databases
Gísli R. Hjaltason, Hanan Samet
1998
0
0
443
443
l1205
r395
Proxy-Server Architectures for OLAP
Panos Kalnis, Dimitris Papadias
2001
Proxy-server architectures for OLAP
Panos Kalnis, Dimitris Papadias
2001
1
1
444
444
l915
r1532
Efficient k-NN search on vertically decomposed data
Niels Nes, Martin L. Kersten, Nikos Mamoulis, Arjen P. de Vries
2002
Efficient k-NN search on vertically decomposed data
Arjen P. de Vries, Nikos Mamoulis, Niels Nes, Martin Kersten
2002
1
1
445
445
l365
r53
50,000 Users on an Oracle8 Universal Server Database
Ashok Josji, Tirthankar Lahiri, Amit Jasuja, Sumanta Chatterjee
1998
A workflow-based electronic marketplace on the Web
Asuman Dogac, Ilker Durusoy, Sena Arpinar, Nesime Tatbul, Pinar Koksal, Ibrahim Cingil, Nazife D...
1998
0
0
446
446
l458
r767
Comparing Hierarchical Data in External Memory
Sudarshan S. Chawathe
1999
Context-Based Prefetch for Implementing Objects on Relations
Philip A. Bernstein, Shankar Pal, David Shutt
1999
0
0
447
447
l655
r412
The SDSS skyserver: public access to the sloan digital sky server data
Tanu Malik, Jordan Raddick, Alexander S. Szalay, Peter Z. Kunszt, Jim Gray, Christopher Stoughto...
2002
Report on the ACM fourth international workshop on data warehousing and OLAP (DOLAP 2001)
Joachim Hammer
2002
0
0
448
448
l123
r1493
Change-Centric Management of Versions in an XML Warehouse
Laurent Mignet, Amlie Marian, Gregory Cobena, Serge Abiteboul
2001
A Sequential Pattern Query Language for Supporting Instant Data Mining for e-Services
Reza Sadri, Carlo Zaniolo, Amir M. Zarkesh, Jafar Adibi
2001
0
0
449
449
l590
r295
Skew handling techniques in sort-merge join
Richard T. Snodgrass, Wei Li, Dengfeng Gao
2002
QURSED: querying and reporting semistructured data
Yannis Papakonstantinou, Michalis Petropoulos, Vasilis Vassalos
2002
0
0
450 rows × 11 columns
In [ ]:
Content source: anhaidgroup/py_entitymatching
Similar notebooks: